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Recognition of pathogens relies on families of proteins showing great diversity. Here we construct 
maximum entropy models of the sequence repertoire, building on recent experiments that provide 
a nearly exhaustive sampling of the IgM sequences in zebrafish. These models are based solely on 
pairwise correlations between residue positions, but correctly capture the higher order statistical 
properties of the repertoire. Exploiting the interpretation of these models as statistical physics 
problems, we make several predictions for the collective properties of the sequence ensemble: the 
distribution of sequences obeys Zipf's law, the repertoire decomposes into several clusters, and 
there is a massive restriction of diversity due to the correlations. These predictions are completely 
inconsistent with models in which amino acid substitutions are made independently at each site, 
and are in good agreement with the data. Our results suggest that antibody diversity is not limited 
by the sequences encoded in the genome, and may reflect rapid adaptation to antigenic challenges. 
This approach should be applicable to the study of the global properties of other protein families. 



I. INTRODUCTION 

The number of possible amino acid sequences exceeds 
the number of individual protein molecules that have 
ever been synthesized. As a result, the limited set of 
sequences that we see today carries a signature of evo- 
lutionary history [1 j. But not all of the limitations are 
historical — randomly chosen sequences will not fold into 
stable, compact structures [2j|3], and carrying out spe- 
cific functions places yet more requirements on the se- 
quence. Regardless of the balance between historical and 
functional constraints, the stochastic nature of evolution- 
ary change means that the sequences we observe should 
be thought of as being drawn out of a probability dis- 
tribution. The goal of this paper is to construct an ap- 
proximation to this distribution, using a limited but bi- 
ologically important example, the problem of antibody 
diversity. 

The ensemble of all proteins is daunting, so most work 
focuses on particular families of proteins. The most 
tractable examples are those in which the relevant seg- 
ments of the proteins are short, and experiments provide 
many independent samples of sequences from the fam- 
ily. For a family of small proteins that mediate protein- 
protein interactions, methods were developed to generate 
new sequences that are consistent with the patterns of 
single site substitutions and correlations between sub- 
stitutions at pairs of sites; remarkably, most of these 
new sequences fold into functional structures [4j [5] . Al- 
though this work did not lead to an explicit construc- 
tion of the underlying probability distribution, the im- 
plicit model is equivalent to a maximum entropy model 
that captures pairwise correlations but ignores higher or- 
der interactions [6], and thus connects to other efforts 
to describe biological networks with simplified models 
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\M HU H2]. Maximum entropy methods have 
since been used to look at protein-protein interactions in 
bacterial signaling [13], and at the serine proteases [T4] . 

A key feature of the maximum entropy approach is 
its intimate connection to statistical mechanics [TSJ [16] . 
Maximum entropy models predict the underlying proba- 
bilities in the form of a Boltzmann distribution, thus as- 
signing an effective energy to every amino acid sequence 
in our ensemble. Natural questions about this statistical 
mechanics problem have clear biological correlates: What 
is the entropy in sequence space, or equivalently the al- 
lowed diversity of functional proteins? Does the energy 
landscape break up into multiple valleys, corresponding 
to clusters of closely related proteins? Are the barriers 
between these valleys large, so that different clusters are 
isolated, or are there paths that can smoothly mutate 
one class of sequences into another? Are the interactions 
among substitutions at different sites strong or weak? 
Is is possible that these interactions are tuned to some 
special values, perhaps analogous to critical points in sta- 
tistical mechanics? Here we approach these problems in 
the context of antibody diversity. 

For antibodies, sequence diversity has a direct biolog- 
ical function, setting the range of antigenic challenges to 
which the organism can respond. Classical work has em- 
phasized the combinatorial diversity generated by piec- 
ing together different segments of the antibody molecule, 
each of which is encoded in the genome [17] • Very re- 
cently it has become possible to provide the sequences of 
essentially every single antibody molecule in individual 
organisms [18] , and this explosion of data invites us to 
look more closely at the diversity within the combined 
segments, beyond that represented in the genome itself. 
As we will see, for the zebrafish studied in Ref [18 , this 
non-genomic diversity is substantial, and concentrates 
in short segments of the molecule, the D regions of these 
molecules. This combination of focus on short sequences 
and a nearly complete sampling of the relevant ensemble 
provides a unique opportunity to address the theoretical 
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questions outlined above. 



II. DEFINING THE PROBLEM 

All jawed vertebrates are endowed with an adaptative 
immune system that responds to and 'remembers' a wide 
range of challenges from the environment. One major 
component of the immune system are the B cells, each 
of which expresses multiple copies of a single antibody 
molecule on its surface. Binding to these molecules is 
the fundamental step by which the system recognizes an 
antigen, and hence the diversity of these molecules de- 
fines the range of pathogens to which the organism can 
respond effectively [T9] , During the development of B 
cells, the genome is modified by recombination to en- 
code a single antibody sequence assembled from three 
pieces termed V, D and J. In the zebrafish [20] , there 
are 39 choices for the V region, 5 for D, and 5 for J, 
for a total of 975 possible VDJ combinations or classes. 
During recombination, nongenomic nucleotides are ran- 
domly added and others are removed at the VD and DJ 
junctions, generating what is called junctional diversity. 
Further, during the lifetime of the organism, the anti- 
body sequences encoded in proliferating B cells undergo 
somatic hypermutation. Finally, B cells that successfully 
bind pathogens proliferate, while B cells that are not used 
are eliminated. As a result, the expressed repertoire of 
antibodies is a complex combination of VDJ class, phy- 
logenic history and pathogen environment. 

The experiments of Ref [18] give us a snapshot of 
the complete antibody repertoire in each of fourteen ze- 
brafish, labelled A through N. More precisely, these ex- 
periments extracted the mRNA for the complementar- 
ity determining region 3 (CDR3) of the heavy chain of 
IgM molecules, reverse transcribed, amplified, and then 
sequenced the resulting cDNA using high throughput 
methods. It will be important in our analysis that the 
amplification step has biases, and so all averages over 
the distribution of sequences must be re-weighted by a 
primer dependent amplification, as discussed in [18] (see 
the appendix). Each fish yielded from 28,000 to 112,000 
sequence reads of ~ 200 nucleotides covering the last 90 
nucleotides of V, and all of D and J. 

The V and J segments of all the sequences are eas- 
ily recognized by aligning with the genome, discarding 
a small fraction of sequences with stop codons or frame 
mismatches. The situation for D regions is more subtle, 
and so we define the D region to be all the residues that 
lie between the identifiable parts of the V and J segments, 
as explained more fully in the appendix. 

We find that the D region is much more diverse than 
expected from its genomic origin, and concentrates most 
of the nongenomic diversity, as illustrated in Fig. [I] Most 
obviously, in the genome D regions range from 11 to 14 
nucleotides, while in the sampled sequences the D re- 
gion range from 1 to 6 amino acids (3 to 18 nucleotides; 
Fig [l]A.). If we try to match each sequence to one of 
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FIG. 1: Antibody diversity concentrates in the D re- 
gion (data from fish A). A Length distribution of the D re- 
gion aminoacid sequences. B. Quality of the assignment of 
the D region to a genomic template, measured by the differ- 
ence between the alignment scores of the first and second best 
matches (normalized by the best attainable score difference). 
C. Left: mutual information between residue positions across 
all VDJ classes. Right: mean mutual information within each 
VDJ class. Variability only remains in the D region. 



the genomic sequences, the quality of these assignments 
typically is quite poor (Fig[lj3). By using mutual infor- 
mation between residue positions as a measure of vari- 
ability within VDJ classes (see the appendix), we find 
that residues in the D region are both variable and cor- 
related even within a given D class, whereas the V and 
J regions show very little diversity within their classes 
(Fig[T]C). Junctional diversity, somatic hypermutations 
or other mechanisms may be the source of this nonge- 
nomic D variability, and could explain the poor quality 
of the D assignments. Independent of the mechanism, 
these results suggest that, in trying to define the distri- 
bution of sequences represented in the system, we should 
focus our attention on the D region. 

To be precise, we describe each observed D region se- 
quence as a = (<7i, <J2, • • • , <tl), where L is the length of 
the sequence. At each site along the sequence, <Ji can take 
on twenty different values, corresponding to the twenty 
possible amino acids (<Ji = Ala, Arg, Asn, . . . ). We would 
like to know the probability P(cr) that any particular se- 
quence will be found in the antibody repertoire of each in- 
dividual. The difficulty is that there are ~ (20) Lmax pos- 
sible sequences, where L max = 8 is the maximum length 
of the D region; in principle each sequence can occur with 
a different probability, and hence the number of possible 
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FIG. 2: Maximum entropy model. A. The model of 
the D region is viewed as a system of interacting residues 
(<ti, . . . , cfl) in thermal equilibrium, schematized here by its 
interaction network for K — 2. To each sequence cr is asso- 
ciated an energy E(a), Eq Then the sequences of the 
repertoires are drawn at random from the Boltzmann distri- 
bution , Eq ([2J. B. Pairwise frequencies of nearest- (red) 
and second-neighbor (yellow) residues. Left: comparison be- 
tween the model prediction, where the model was fitted with 
the training data, and the testing data. Right: direct com- 
parison between the training data and the testing data. The 
scatter is of the same magnitude, showing that the model is 
as precise as the data allow. 



sequences is also the number of parameters required to 
specify the distribution. This number, ~ 2.5 x 10 10 , is 
much larger than the number of independent measure- 
ments that we can make, and perhaps even larger than 
the number of B cells in the entire zebrafish at any one 
moment. How, then, can we make progress? 



III. MAXIMUM ENTROPY MODELS OF THE 
D REGION 

While experiments cannot characterize the entire dis- 
tribution P(cr), it is possible to make reliable measure- 
ments of many averages over this distribution. For ex- 
ample, we can characterize the probability that any sin- 
gle amino acid appears in the sequence, Pi(cr). Further, 
we can characterize the probability that two particular 
amino acids appear separated by a distance k along the 
sequence, P2(a, a'; k), and we can do this for nearest 



neighbors (k = 1), next nearest neighbors (k = 2), and 
so on. Notice that these quantities do not refer to spe- 
cific sites along the sequence, but rather to pairs of sites 
separated by given distances; in this way we can analyze 
sequences that have variable lengths and are difficult to 
align, as observed for the D regions. We could continue 
along this line, characterizing the probability of occur- 
rence of triplets, quartets, etc., but at some point we will 
run out of data. 

The central idea of maximum entropy models is to 
take some limited set of averages seriously as a charac- 
terization of the system, and then build the least struc- 
tured model for the distribution P(cr) that is consistent 
with these data [I5j [16] . Formally, minimizing structure 
means maximizing the entropy 

S[P] = -^P(<T)l0g 2 [P(<7)]. (1) 

cr 

Here we will find the maximum entropy distribution con- 
sistent with the single residue frequencies, Pi(cr), with 
the pairwise distributions of amino acids along the se- 
quence, P2(a, a'; k), and with the observed distribution 
of lengths of the D region, P(L). Finding this model dis- 
tribution, which we denote p( m ) , involves solving an op- 
timization problem (maximize S) subject to constraints 
(the observed distributions). Because of the connection 
between maximum entropy distributions and statistical 
mechanics, the form of the solution is well known. 

We can write p( m ) in the form of the Boltzmann dis- 
tribution, as if the sequences represented the state of a 
physical system in thermal equilibrium: 

p(-) = I eX p[-£7( < r)], (2) 
where the effective energy of each sequence is 

L K 

E(a) = -^L)-J2K^)-J2 E ^aj), (3) 

i=l k=l i,j 

i-J=k 

To complete the analogy to thermodynamics, we should 
think of the temperature as being such that fc#T = 1. 
Then fi(L) acts like a chemical potential for adding 
residues, h(a) is a biasing field that prefers some amino 
acids over others, and the couplings Jk describe the inter- 
actions between amino acids at different sites, reaching 
across a range K, as schematized in Fig [2}^. The ft's, 
J's and /i's must be chosen such that P( m )(L), p[ m ^ and 
P^ agree with the data. 

Calculating p( m )(L), P 1 (m) and P 2 (m) from the full dis- 
tribution P( m )(cr) is hard in general, and the inverse 
problem of inferring the model parameters from these ob- 
servables is clearly not easier. We solve the inverse prob- 
lem by combining Monte Carlo simulations with gradient 
descent (see the appendix). The number of parameters 
can be fairly large, 399K + 19 + P m ax ~ 10 3 , although 
vastly smaller than the number of possible parameters 
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FIG. 3: Local observables and the entropy are well cap- 
tured by the model. A. Position-dependent aminoacid fre- 
quency. Top, frequency as a function of position i = 1, . . . , 4 
from the left end of the sequence. Bottom: comparison be- 
tween model and data of position-dependent frequencies, nor- 
malized by the prediction of the independent model. Error 
bars are obtained as the standard deviation over many choices 
of partition between training and testing sets. B. Compari- 
son of triplet frequencies of contiguous aminoacids, normal- 
ized by the prediction of the independent model. The small 
crosses illustrate one choice of the training/testing partition. 
The black error bars represent the average measurement er- 
ror made on a triplet frequency at that frequency value, ob- 
tained as the standard deviation over many choice of the train- 
ing/testing partition. The diagonal error bars show the av- 
erage error between model and data. C. Entropy of all fish: 
from frequency counting, from the independent model, and 
from the maximum entropy model with range K — 1, . . . , 4. 



(20) Lmax . To test the validity of our method and control 
for overfitting, we learned the maximum entropy distri- 
bution from only one half of the sequences (training set). 
Then, the model predictions were compared to the sec- 
ond half of the data (testing set). We solved the inverse 
problem and tested our solution for all 14 fish and for 
different interaction ranges K = 1, 2, 3, 4. Our results 
showed excellent agreement with the data, as illustrated 
in Fig. for the pairwise frequencies in fish A (K = 2). 



IV. TESTING AND EXPLORING THE MODEL 

The maximum entropy model is the least structured 
model consistent with the observed pairwise correlations 
among amino acids, but of course there is no guarantee 
that Nature is described by this minimal model. To test 



the model, we look systematically at its predictions for 
measurable quantities that are not already used in deter- 
mining the model parameters. If we can convince our- 
selves that these predictions are at least approximately 
correct, we can take the model more seriously and ask 
what it tells us about the nature of antibody diversity. 



A. Local biases 

The model we have constructed does not incorporate 
any site specificity — interactions between amino acids de- 
pend on the distance between them but not on their ab- 
solute location along the sequence [Eq But, since 
amino acids at the start or end of the sequence have only 
half the number of neighbors that are available to sites 
in the middle of the sequence, the model predicts 'end 
effects' which will be manifest as position specific biases 
in amino acid composition. As shown in Fig [3]A., these 
predicted biases can be large, so that the probability of 
finding particular amino acids at specific sites, P^cr), 
can vary by more than two orders of magnitude. These 
predictions are in very good agreement with the data. 
We emphasize once again that these predictions of site 
specific substitution patterns are obtained from a model 
that has no explicit site specific information, and which 
is learned from an ensemble of sequences that have not 
been aligned. In a similar spirit, we find good agreement 
between the predicted and observed probabilities of con- 
tiguous amino acid triplets (Fig. [3^3), even though the 
model has no explicit three site interactions. 



B. Zipf's law 

The space of possible sequences is so large that we 
cannot test the predictions for the distribution P(cr) di- 
rectly. Still, we can get a global view of the distribution 
through a Zipf plot, in which we put the observed se- 
quences in order based on their frequency of occurrence, 
and plot probability P vs. rank r, as in Fig |4j We see 
that both the data and the predictions of the model are 
very close to obeying Zipf's law, P oc 1/r [2TJ [22] , and the 
data and model agree very well with one another. The 
same pattern is observed in all fish, although the rank- 
ing of particular sequences varies. The dynamic range 
over which we can observe Zipf's law is limited by the 
number of independent sequences that are read in the 
experiments, but the model predicts that this behavior 
should continue even if this number were extended by an 
order of magnitude. 

Zipf's law first attracted attention in the context of 
language [22 , and many models have been proposed for 
the origin of this behavior. Even before Zipf's work, 
it was known that some growth processes with muta- 
tions can generate Zipfian distributions [2TJ [23]. Since 
we have built a model out of measured pairwise correla- 
tions, with strong analogies to statistical mechanics, we 
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FIG. 4: The distribution of D regions obeys Zipf's law. 

Probability of D region sequences as a function of their rank 
in fish A, as observed from frequency counting (blue line), and 
as predicted by the independent (green line) and the maxi- 
mum entropy model with K — 2 (red line). The dashed line 
has slope —1. Inset: the same for all fish, from frequency 
counting. 

emphasize that Zipf's law reflects the proximity of a crit- 
ical point in the strength of the underlying interactions. 
The rank of a state a is determined by the number of 
states with higher probability, or lower energy in Eq 
But counting the number of states is equivalent to mea- 
suring the (microcanonical) entropy, and then Zipf's law 
is the statement that the entropy grows linearly with the 
energy, with slope one (see the appendix). This locally 
linear relation between energy and entropy is characteris- 
tic of thermodynamic systems at a critical point [24 , and 
could not emerge from a system of noninter acting units, 
or even from an interacting system with slightly weaker 
or stronger correlations. Thus, the strength of correla- 
tions that we see in the real sequences corresponds to 
interactions with a critical strength, restricting the set 
of allowed sequences substantially, but not forcing the 
system to 'freeze' into a small set of possibilities. 



C. Entropy 

The fundamental quantity in a maximum entropy con- 
struction is the entropy S itself. Entropy measures the 
diversity in sequence space, and hence is also a fundamen- 
tal quantity from a biological point of view. If we imag- 
ine that sequences are constructed by choosing amino 
acids at random, then the entropy could be as large as 
log 2 (20)bits per residue, or a total of ~ 15 bits for the 
average length D region. For almost all fish (F is an 
exception, and is excluded from further analyses), the 
observed biases in the use of the different amino acids do 
not reduce this very much; that is, if we choose amino 
acids independently at every site but with the observed 



frequencies, 

L 

P ind (<7) = P(£)II P i(^), (4) 

2 = 1 

then the entropy S^Pind] of this independent model is 
nearly log 2 (20) bits per residue. We can think of the max- 
imum entropy model as part of a hierarchy, in which the 
entropy is reduced every time we take account of addi- 
tional correlations [25[. As shown in Fig[3]C, the entropy 
is reduced significantly as we take account of correla- 
tions between neighboring amino acids, corresponding to 
K = 1 in Eq It is reduced further when we in- 

clude next-nearest neighbors (K — 2), and the reduction 
seems to plateau as we include more distant neighbors, 
K = 3, 4. Including all of these pairwise correlations 
pushes the total entropy well below 10 bits for all fish, so 
that out of tens of thousands of possible sequences, most 
of the distribution is concentrated in only a few hundred 
(~ 2 s ) sequences, and this is consistent with what we 
observe in the Zipf plots (Figj4|. This restriction of se- 
quence space is even more dramatic when we realize that, 
given the maximum length of the D regions, there really 
are tens of millions of possible sequences. 

The difference between the entropy of the independent 
model and the true entropy, I = S[P{ n d] —S[P], measures 
the overall strength of correlations in the system, and 
is called the multi-information. The maximum entropy 
model predicts a value for /( m ) = S[P ind ] -S[P^] which 
must be smaller than /, and the ratio /( m )/J meausres 
the fraction of the correlated structure that we capture in 
our model. The difficulty is that because sequence space 
is large, estimating the entropy S[P] is difficult. Methods 
are available, however, which allow us to estimate S[P] 
even when we don't have enough samples to accurately 
estimate P(cr) itself, as explained in the appendix and 
[26] . Using these methods, we find, as shown in Figj3p, 
/( m ) jl in the range from 0.67 to 0.91 across the different 
fish. Thus our maximum entropy model, based only on 
pairwise correlations, captures between two thirds and 
ninety percent of all the correlated structure in the dis- 
tribution of sequences. 

D. Comparison between fish 

The analysis of entropies shows that the repertoires 
of individual fish span only a tiny fraction of the possi- 
ble sequence space. Do the repertoires of different fish 
overlap with each other, or are they distinct? To an- 
swer this question, we first computed a similarity factor 
Sim[P a ,P/3] between repertoire distributions (see the ap- 
pendix). This factor takes values between and 1 and 
measures the difficulty of guessing to which of the two 
repertoires (a or (3) a given sequence belongs. Figure [5^ 
shows the similarity factor for all pairs of fish, as calcu- 
lated by the maximum entropy model (see the appendix) . 
While the choices of V, D and J segments are correlated 
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FIG. 5: Fish repertoires overlap yet are specific. A. 

Table of similarity factors between all pairs of fish. The sim- 
ilarily factor is defined as mhiA Pa(&) 1 ~ X Pp{&) X , where 
P a and P/3 are the probability distributions of the D regions 
in fish a and (3. The thick red squares show the families. 
B. Mutual information between fish and sequence vs. the 
entropy of fish. Each point is a subgroup of all 13 fish (ex- 
cluding fish F), color-coded by its size (from dark blue to red). 
Filled circles are averages over groups of each size. Upper In- 
set: comparison between mutual information estimated from 
counting observed sequences, and that predicted by the max- 
imum entropy model. Lower Inset: Mutual information vs. 
fish entropy as predicted by the independent model. 



with the family relations among the fish [T8] , this mea- 
sure of similarity among D regions is not. 

To study repertoire specificity beyond two fish, we 
looked at the mutual information between the identity a 
of a fish and the sequence a of a single antibody molecule, 



I (or, a) ^P(<t, a) log 2 



P(<r,a) 



[P(a)P(a)\ 



(5) 



where P(<r, a) is the probability that a sequence picked 
at random in the dataset be a and come from fish a. 
Figure [5)3 represents this mutual information as a func- 
tion of the fish entropy S a = — J2 a P(ot) log 2 [P(a)] for 
many subgroups of fish of various sizes. The fish entropy 
is an upper bound to the mutual information, and is only 
reached when sequences give perfect information about 
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FIG. 6: Metastable states (data from fish A). A Lower: 
scores of pairwise alignments between the genomic segments 
D1-D5 and the metastable states. The bar plot represents 
the total weight of the basins of attraction of each metastable 
state. Upper: scores of alignments of the genomic segments 
with themselves and with each other are shown for compar- 
ison. B. Basins of attractions of the seven most populated 
states. A density plot represents the energy of the sequences 
vs. the number of steps separating them from their metastable 
state by steepest descent. C. Connectivity of the sequence 
space. Lines indicate the existence of paths of adjacent se- 
quences between two metastable states. When the link is a 
solid line, there exists a path made only of single- nucleotide 
mutations. 



which fish they came from, i.e. when each sequence be- 
longs to one fish uniquely. Although the mutual informa- 
tion remains far from this upper bound, it keeps growing 
linearly with the entropy as the size of the group is in- 
creased, each fish adding its own unique diversity; the 
slope of information vs. entropy is roughly 0.5, so that 
half of the diversity is unique to each individual and half 
is shared across the population. Importantly, this indi- 
viduality of the sequence ensembles depends dominantly 
on correlations, since in the independent model, Pi n d(c r ) 
from Eq Q, the mutual information between identity 
and sequence is roughly a factor of four smaller (inset to 
Fig [5^3). All 13 fish do not suffice to cover the potential 
diversity of D regions, as evidenced by the absence of 
saturation. 



E. Multivalley landscape 

The energy function in Eq ([3| includes competing 
interactions — the couplings J can be positive or nega- 
tive, favoring both correlated and anti-correlated amino 
acid substitutions at different sites. From the statisti- 
cal mechanics of disordered systems [27] we know that 
such competition can lead to 'frustration' and many 
metastable states. A metastable state is defined as a lo- 
cal mimimum of the energy landscape or, in probabilistic 
language, a local maximum of the probability distribu- 
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tion. Does this happen in the case of antibody diversity? 

Our model assigns an energy to every sequence, but 
to find local minima in this landscape we need to define 
'local.' Since mutations occur at the level of nucleotides, 
we work in the space of nucleotide sequences; to assign 
a (free) energy to nucleotide sequences we translate to 
amino acids, compute E(a) from Eq (J3|, and add a cor- 
rection term for the entropy of codon usage. Then we say 
that two sequences are adjacent if: (i) they differ by one 
nucleotide; or (ii) they differ by one nucleotide insertion 
and one deletion; or (in) they differ by three insertions 
or three deletions; the last criterion is necessary because, 
by construction, the lengths of D regions is a multiple 
of 3. With this conservative definition, we find ~ 10 lo- 
cal minima per fish; examples are shown in Fig [6] Some 
of these states correspond to the D regions encoded in 
the genome, as shown in Fig [6]A., but many do not. The 
structure of the energy landscape, and hence the proba- 
bility with which sequences appear in the organism's an- 
tibody repertoire, thus has elements which are not simply 
a record of genomic history, but presumably reflect rapid 
adaptation to the antigenic environment. 

Each metastable state defines a basin of attraction or 
valley in the energy landscape, and we can assign each se- 
quence to its corresponding valley by moving 'downhill:' 
starting from a given sequence, go to the lowest energy 
neighbor, and continue doing so until the energy stops 
decreasing and a metastable state has been reached. Fig- 
ure [6^3 represents the energy of all sequences in a basin 
of attraction as a function of their distance (in num- 
ber of steps) to the metastable states; although there 
are differences of detail, the different basins have very 
similar structures. As we explore away from the mini- 
mum energy in each basin, at some point we reach the 
'pass' which connects neighboring valleys; the trajecto- 
ries over these passes are analogous to the trajectories 
from reactants to products in a chemical reaction, with 
the pass identifiable as the transition state [28] . Since the 
sequences are not too long, we can find these paths by a 
conventional Monte Carlo procedure (see the appendix), 
and in most cases we found continuous paths through 
adjacent observed sequences between metastable states. 
When the two metastable states had the same length, 
we found paths where each step was a single nucleotide 
mutation. Figure [6p summarizes the connections among 
the seven most populated metastable states in the reper- 
toire of fish A. Taken together, these results on the en- 
ergy landscape imply that the repertoire explores much 
of the sequence space, and is not slaved to the genomic 
templates or to any specific sequence arising in the adap- 
tation process. 



V. SUMMARY AND DISCUSSION 

The formation of the antibody repertoire is an exam- 
ple of an accelerated evolutionary process under selective 
pressure. Antibodies in a given organism are correlated 



both through their genomic origin and as a result of the 
adaptation history. In this study we have analyzed the 
repertoire of B cell antibodies by building compact mod- 
els of the hypervariable region of their heavy chain, based 
on the principle of maximum entropy. 

The reduction of parameters achieved by the model is 
enormous. Even though we are looking at the relatively 
short hypervariable D regions, there are tens of millions 
of possible sequences, and in principle each sequence oc- 
curs with a different probability in the repertoire. In 
constrast, the number of parameters of our model is of 
order 400K, where K is the interaction range. Impor- 
tantly, this number scales reasonably with sequence size, 
making our approach tractable for systems in which the 
relevant sequence is much longer, including the hyper- 
variable regions in other species. The compactness of the 
model allows for generalization, so we can predict quanti- 
ties that are not deducible simply by counting sequences 
in the observed sample: the overall size of the repertoire, 
the overlaps between repertoires of different individuals, 
and the probability of finding new, as yet unobserved, 
sequences in larger samples from the same individual. 

The maximum entropy construction accounts for cor- 
relations between amino acid substitutions at different 
residue positions through an effective interaction struc- 
ture. These interactions are strong enough to generate a 
dramatically different ensemble of sequences than would 
be expected if substitutions at each site were indepen- 
dent. The diversity of the repertoire is substantially re- 
duced (from an entropy of ~ 14 bits to ~ 8 bits), the 
distribution of sequences obeys Zipf's law, and the dis- 
tribution has a complex structure of 'metastable states,' 
clusters of sequences with high probability. 

We have addressed the question of individuality, using 
our model and tools from information theory. At one 
extreme, the fish could be completely different, and each 
new fish would bring a whole new set of unique sequences. 
At the other extreme, fish could have more or less identi- 
cal repertoires, sharing the same antibodies in the same 
proportions. We found an intermediate situation, where 
about 50% of the repertoire diversity was unique to each 
fish, and the rest shared among all fish. As one concate- 
nates the individual repertoires, including more and more 
fish, the size of the resulting meta-repertoire must satu- 
rate, since the number of possible antibody sequences is 
finite. But this saturation is not reached even for thirteen 
fish, meaning that each fish is still unique compared to 
all other twelve taken together, and not only compared 
to each of them separately. 

The details of the adaptation process undergone by 
the repertoire are largely unknown, and our model only 
provides a first step to aid in its study. What is the 
mutation mechanism? How do recognition and selection 
work? Our observation of Zipf's law provides an impor- 
tant constraint on these mechanisms. As we have empha- 
sized, this behavior arises only if the interactions between 
substitutions at different sites have a critical strength. 
But these interactions are just a summary of the muta- 
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tion and selection dynamics. There are simple growth 
processes with mutation that can generate Zipfian distri- 
butions [23], but much work remains to find a realistic 
model that generates the full structure of P(cr). 

The structure of the energy landscape underlying our 
model shows that the repertoire decomposes into sev- 
eral components. Each component is centered on a 
metastable state, a peak in the probability distribution of 
sequences. Some metastable states are closely related to 
the genomic templates, although rarely identical, while 
others are not attributable to any genomic template. We 
can think of these metastable states as markers of adap- 
tation. For example, an infection could have caused the 
proliferation of antibodies particularly efficient for rec- 
ognizing a specific antigen, thus creating a peak in the 
probability landscape. This suggests the possibility of 
using metastable states and their basins of attraction for 
probing infectious history, perhaps in experiments that 
follow the dynamics of the sequence ensemble over time. 

The clusters associated with the metastable states are 
not completely disconnected from one other: we found 
continuous paths of observed sequences between most 
metastable states. This means that, far from being slaved 
to their genome, the D sequences have the freedom to 
explore sequence space extensively during the adaptation 
process, forming a large cloud of possibilities between the 
higly concentrated regions of the sequence space, i.e. the 
metastable states, whether they be genomic or not. The 
method we have used for finding these paths — a Metropo- 
lis walk in energy space — further illustrates the power of 
the maximum entropy model: since it naturally favors 
low energy barriers, this algorithm is more likely to find 
paths where all sequences are present in the data. More 
generally, it could be used as a tool for retracing muta- 
tion paths between any two sequences, and could lend us 
insight into the repertoire's evolutionary history. 

Finally, the success of maximum entropy models in 
accounting for the higher order statistical structure of 
the sequence ensemble encourages us to think that this 
approach is more widely applicable. The maximum en- 
tropy formalism shows how, as in many statistical physics 
problems, the observable correlations between amino acid 
substitutions at any two sites provide the signatures of 
collective behavior in the system as a whole. The idea 
that crucial aspects of life should be viewed as emergent, 
collective phenomena has been discussed for decades. 
The challenge has been to move beyond metaphor by de- 
veloping precise mathematical tools for extracting quan- 
titative models of this collective behavior from experi- 
ment. We believe that we have taken useful steps in this 
direction in the work reported here. 
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APPENDIX 
A. Aligning sequences 

The dataset [18] consists of a list of ~ 200 nucleotide 
long sequences for each fish. Each read is aligned to a V 
and J genomic template using the Smith- Waterman al- 
gorithm [29] with uniform scoring matrix (1 for a match 
and -1 for a mismatch), and then we isolate the subse- 
quence that starts after the last nucleotide aligned to V, 
and ends before the first nucleotide aligned to J. This 
subsequence is then aligned with each of the five D tem- 
plates, and assigned to the template that gave the best 
alignment score. To estimate the quality of this assign- 
ment, we calculate the score difference between the first 
and second best matches, divided by the same quantity 
assuming that the sequence is exactly identical to its ge- 
nomic template. This ratio, which we call discriminabil- 
ity, is when the two best genomic matches have the 
same score, and 1 when the sequence to assign is identi- 
cal to a genomic template. Fig.[lj3 shows the distribution 
of the discriminability ratio across all sequences in fish A. 

Reads that have a stop codon, or for which the read- 
ing frames of the V and J segments are not congruent, 
are discarded. The remaining reads are translated into 
aminoacid sequences, which are aligned with their V and 
J genomic aminoacid sequences. Our analysis focuses on 
the D-region aminoacid subsequence, which is composed 
of the residues located strictly between the last aligned 
residue of V and the first aligned residue of J. The length 
distribution of the obtained D regions is shown Fig. [I]A 

For the purpose of measuring statistical quantities, 
each sequence read a is assigned a weight w(cr) inversely 
proportional to the PCR bias associated to its sequence 
primer [18] . Thus the mean of any observable O(a) will 
be estimated using: 



(O) 



(6) 



B. Diversity across and within VDJ classes 

To estimate the variability and correlations among 
residues, we calculated the mututal information between 
pairs of residues at positions i and j of the sequence, 
which measures the degree of correlation between two 
positions, and is defined by: 



(7) 



where P[(cr) is the probability of having residue a (= Ala, 
Arg, Asn, . . .) at position i, and P{](a, a 1 ) the probabil- 
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ity of having a at position i and cr' at position j . These 
probabilities are estimated by frequency count, weighted 
by the primer dependent PCR bias as in Eq ([6|. For the 
purpose of comparing sequences at the same positions, 
we performed a multiple alignment of all the sequences. 
For each read, the V and J regions were aligned to their 
genomic template. Then, all 39 V and all 5 J genomic 
templates were aligned with the multiple sequence align- 
ment software ClustalW. Since we could find no satisfac- 
tory multiple alignment of the D regions or even the 5 D 
templates, we arbitrarily aligned all D regions by their 
first amino acids. 

The left panel of Fig. [Tp shows that antibody se- 
quences indeed are highly variable and show large pair- 
wise correlations, in agreement with the recombination 
scenario. Diversity in the choice of the V, D, J genomic 
segments induces variability at each residue position (as 
measured by the diagonal terms of the mutual infor- 
mation matrix, i.e. the single-position entropies), and 
these variables are themselves correlated through their 
common genomic cause (as measured by the off-diagonal 
terms). 

We can estimate the part of diversity that is not due 
to recombination by calculating the average mutual in- 
formation within VDJ classes: /^. cond ^ = (I^(VDJ))vdj 
(where L^iVDJ) is the mutual information within VDJ), 
shown in the right panel of Fig[lp. This mutual informa- 



tion is small throughout the V and J segments, indicat- 
ing that the choice of the V and J templates is the main 
source of diversity in these regions. In contrast, the D 
region remains diverse even within D classes. 



C. Solving the maximum entropy model 

For a set of parameters (/i, /i, J), the observables 

p( m )(L), P[ m \ and P 2 (m) are estimated by the Metropolis 
algorithm. Starting from the independent model as ini- 
tial condition: fi(L) = logP(L), h(o~) = logPi(a), and 
Jk = 0, we implement the following update rules: 



H(L) 
h(a) 



H(L) + ei log 



h(a) + e 1 log 



P( m )(L)' 

p< m V)' 



(8) 
(9) 



Ma,r) + e 2 log *£' r ' k \ . (10) 



(cr,r;/c) 



The last of these three rules is implemented every 5 steps, 
while the first two rules are implemented the remaining 
four steps. We set e\ = 0.005 and 62 = 0.01. 



J 



D. Entropy calculations 

To calculate the entropy of the probability distribution P, we need an estimate of the number P(<x) for each D 
region sequence cr. We estimate this number using Eq. [6j The distribution is undersampled due to the large number 
of possible sequences compared to the actual number of reads, and therefore we expect some systematic error. To 
reduce that error we compute the entropy with the method described in [26] , which extrapolates the infinite-sample 
limit from finite-sample estimates. 

The entropy of the model distribution p( m ) can, on the other hand, be calculated with arbitrary precision using 
thermodynamic integration. We define 



K 



fJL(L)+ J2 h (°i)+Pj2 J k(<Ti,<Tj) 



k=l 



-J=k 



(11) 



such that Z(l) = Z, and Z(0) = El e/i(L) [£* eHa) ] • 
We obtain Z by calculating the following integral numer- 
ically: 



logZ(l) =lo g Z(0)+ / 
Jo 



\ 1 dlogZ(p) 



d(3 



dp 



with 



d\agZ{fl) 
dp 



/Z (^kOi,^))^, 



i-j=k 



(12) 



(13) 



where the mean (-)p is taken with weights given by Eq 
(11). This quantity can be computed by the Metropolis 



algorithm for each (3. Finally, the entropy is given by: 

S[P^}= log Z+(E{&)) P=U (14) 
where the second term is also computed by Metropolis. 

E. Zipf's law and criticality 

We are describing the distribution of states cr in the 
Boltzmann form, Eq Then a natural quantity is the 
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density of states, 



so Zipf 's law implies 



p(E) = J25[E-E(a)}. 



(15) 



In the limit of a large system, p(E) becomes smooth, so 
that n(E) = p(E)A is the number of states with energy 
E, assuming we have some finite resolution A. The log 
of this number is the entropy, lnn(E) = S(E). Further, 
in the thermodynamic limit we expect that both entropy 
and energy are extensive variables, so we can define the 
energy and entropy per degree of freedom, 



e = E/N 
5(e) = S(E = TVe)/TV, 



(16) 
(17) 



where TV is the number of degrees of freedom. It's hard 
to "measure" the density of states p(E), but it's easier to 
think about the number of states with energy less than 
E, which we'll call Af(E). By definition, 



Af(E) = / dE f p(E f ). 



At large TV, we then have 



N(E) = i J dE' exp[S(E)} 



(18) 

(19) 
(20) 



- J de' exp[Ns(e')] 
N eNs{E/N) J™ dx e - Ns \E/N )x (21) 

(22) 



= L c Ns(E/N) 

As' (E/N) 

But the rank of state er is exactly the number of states 
with energy smaller that E(a), 

ra = N [E = E(a)} . (23) 

With our expression for Af we can write 

ln(ro-) = Ns [E(cr)/N] - In [As' (E/N)} , (24) 

which is dominated by the first term at large N. Then 
Zipf's law, P(er) = A/ra means that 

In P(a) = -Ns [E(a)/N] + In [AAs'(E/N)} . (25) 

But from the Boltzmann distribution we have 

In P(a) = -E(a) - In Z, (26) 



Ns(E/N) = S(E) = E- 



(27) 



where • • • denotes terms independent of E or terms which 
vanish relative to S in the large TV limit. 



F. Similarity factor 

The similarily factor between two distribu- 
tions P a and P/3 is defined as Sim[P a , Pp\ = 
mim\ P a (cr ) 1 ~ x P/3(cr ) A . The similarity factor 
appears naturally in the asymptotic error of the fol- 
lowing discrimination task. Suppose that we know P a 
and P/3, which describe the repertoires of two fish a 
and /?, and that we are given N sequences cr 1 , . . . , cr N , 
which either all come from fish a, or from fish (3. What 
is the probability of attributing the sequences to the 
wrong fish, and how does it decay with the number of 
observations TV? 



The log-likelihood for TV sequences {c 



ing from a is ^f =1 log P a (a l ), and likewise for f3. One in- 
fers that the sequences came from (3 if the log-likelihood 
for (3 is larger, and vice-versa. Thus the probability of 
error is, assuming the sequences came from a: 



} 



N 



P(error) 



1 ,...,(T N i=l 



N 



E lo s 



(28) 



Using the integral representation of Q(x) and a saddle- 
point estimate, we find the asymptotic behavior of the 
error probability for large TV: 



(error) 



mm 

A 



\1-A 



N 



(29) 



which is symmetric in a and (3, and is exactly 
(Sim[P a ,P }f. 

For each A, the sum P a (cr) 1 ~ x Pp(cr) x can be com- 
puted directly from frequency counts, or, in the case of 
the model distribution, by thermodynamic integration as 
explained above. The minimum over A is obtained by a 
line minimization search. 
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